Introduction

I expected to compete in my first Kaggle competitions this March - both the NCAA Men's and Women's basketball tournaments. Unfortunately, both tournaments were cancelled due to the Covid-19 pandemic. Regardless, working through my different ideas was a great way to learn some of the basics and nuances of training an ML model effectively.

The typical successful model on Kaggle is either an XGBoost ensemble model aimed at classifying wins and losses of tournament games. The biggest differences generally come down to feature selection and engineering. A key advantage of neural networks is the ability of a large network to learn non-linear relationships, which ultimately limit the necessity for complex feature engineering. However, training and test data is limited in this case. The most detailed data sets extend back to 2003, which gives ~1,140 games. Many of which have spurious results that the best models would not predict. As expected, a neural network architechture without feature engineering performs much more poorly on this data set; a network with feature engineerig performs similarly to an XGBoost model. In future posts, I will explore more complex deep learning architechtures such as RNNs for tournament prediction, but here I'll focus on feature engineering.

The easiest way to generate features for tournament prediction is to average a team's regular season statistics. The Kaggle data set contains various statistics, but users also commonly generate advanced statistics before aggregation. The aggregated features are useful, but do not compensate for opponent strength. If a team had an easy schedule, it may have artificially higher statistics than another. Some methods like the Bradley-Terry Model, implemented in Python here solve for single-value team strengths using previous comparisons of results across all teams. This implementation solves for team strengths based only on wins and losses and couldn't possibly distinguish between aspects of a teams strength. But what if we generalize this concept of pairwise-comparisons using embeddings?

This notebook will generate team-level embeddings, representative of regular season data (win/losses, point differential, court location) that could be used to train a tournament model. A simple exploratory analysis suggests that the trained embeddings are a richer representation of the original feature data set that could be implemented in a tournament model. Future testing and expansion of this method to advanced statistics will be needed to confirm that!

What you will see in this notebook at a high level:

  • Brief data prep - we are only using wins/losses, points, home/away, and team IDs as inputs to the model. Later, I will expand this model to advanced statistics, but training the model on this subset of data allows us to test the concept all the way back to 1985!
  • Model build - This model is being built with the sole purpose of generating useful embeddings. To achieve that we are training the model to be predictive of features that we would ordinarily use as feature inputs to a real tournament model (in this case, regular season wins and losses).
  • Training and validation - the model is trained using only regular season data from all years and is validated on a secondary set of tournament data (NIT). This is difficult because we have a slight mismatch between our training and validation data. The validation data is generally similar and likely more representative of the real NCAA tournament. That is okay; the end goal of this model is trained embeddings and not win prediction.
  • Sense check and exploratory analysis - First thing is to check that predictions from the model are sensisble, but what we really care about is the embeddings. Do they carry more useful information than simple aggregations of the data they represent? In short, Yes!

Note: This work was inspired by this Kaggle notebook, which is the first basketball application I've seen of this concept. Here is a second example of embeddings used for baseball and the concept could easily be applied to other sports.

Packages and Data

I'll be implementing this in Keras. My previous attempt using FastAI was quick and easy. Using embeddings for categorical data made the FastAI model a bit more elegant than XGBoost. However, we need two input variables (team 1 and team 2) to call the same embeddings matrix in this solution. FastAI can't do that out of the box and so I get to venture into the world of building my own in Keras. I plan on going one step deeper and building my final tournament model with TensorFlow.

#collapse_hide
from pathlib import Path
import numpy as np
import pandas as pd
np.random.seed(13)
import tensorflow as tf
import keras as k
from keras.models import Model
from keras.layers import Dense, Input, Dropout, Activation, Multiply, Lambda, Concatenate, Subtract, Flatten
from keras.layers.embeddings import Embedding
from keras.initializers import glorot_uniform, glorot_normal
from keras.optimizers import Adam
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.manifold.t_sne import TSNE
import altair as alt
np.random.seed(13)
Using TensorFlow backend.

I will be training the embeddings using total point and point differential from each game. Because the training doesn't require the more detailed NCAA data set, we can train using NCAA data all the way back to 1985. Hopfully this will make the weights of the other layers more robust. Depending on the embedding results, the final tournament model could also be trained back to 1985. Let's preview the first few rows of that regular season data here:

#collapse_hide
dataLoc=Path('./data/2020-05-04-NCAA-Embeddings/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/MDataFiles_Stage2/')

df_teams = pd.read_csv(dataLoc/'MTeams.csv')
teams_dict = df_teams[['TeamID','TeamName']].set_index('TeamID').to_dict()['TeamName']

df_regSeason_data = pd.read_csv(dataLoc/'MRegularSeasonCompactResults.csv')
df_regSeason_data.head() # cols = Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT
0 1985 20 1228 81 1328 64 N 0
1 1985 25 1106 77 1354 70 H 0
2 1985 25 1112 63 1223 56 H 0
3 1985 25 1165 70 1432 54 H 0
4 1985 25 1192 86 1447 74 H 0

I want to be able to validate that the embedding training is going in the right direction. For the embedding training I will use the secondary tournament data. This allows us to avoid using the NCAA tournament data that we need for training/testing later, but still get a sense that the embeddings are useful. Here is a preview of that data:

#collapse_hide
df_otherTourney_data = pd.read_csv(dataLoc/'MSecondaryTourneyCompactResults.csv').drop(columns='SecondaryTourney')
df_otherTourney_data.head() # cols = Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
Season DayNum WTeamID WScore LTeamID LScore WLoc NumOT
0 1985 136 1151 67 1155 65 H 0
1 1985 136 1153 77 1245 61 H 0
2 1985 136 1201 79 1365 76 H 0
3 1985 136 1231 79 1139 57 H 0
4 1985 136 1249 78 1222 71 H 0

Embeddings will be defined by the columns 'Season', 'WTeamID', and 'LTeamID'. 'WScore' and 'LScore' will be augmented slightly to be the predicted values and the game location will also be included as an embedding.

#collapse_hide
# Create team encoding that differentiates teams by year and school
def newTeamID(df):
    # df = df.sample(frac=1).reset_index(drop=True)
    df['Wnewid'] = df['Season'].astype(str) + df['WTeamID'].astype(str)
    df['Lnewid'] = df['Season'].astype(str) + df['LTeamID'].astype(str)
    return df

df_regSeason_data = newTeamID(df_regSeason_data)
df_otherTourney_data = newTeamID(df_otherTourney_data)

def idDicts(df):
    newid_W = list(df['Wnewid'].unique())
    newid_L = list(df['Lnewid'].unique())
    ids = list(set().union(newid_W,newid_L))
    ids.sort()
    oh_to_id = {}
    id_to_oh = {}
    for i in range(len(ids)):
        id_to_oh[ids[i]] = i 
        oh_to_id[i] = ids[i]

    return oh_to_id, id_to_oh

oh_to_id, id_to_oh = idDicts(df_regSeason_data)    

# add training data in swapped format so network sees both wins and losses
def swapConcat_data(df):

    df['Wnewid'] = df['Wnewid'].apply(lambda x: id_to_oh[x])
    df['Lnewid'] = df['Lnewid'].apply(lambda x: id_to_oh[x])

    loc_dict = {'A':-1,'N':0,'H':1}
    df['WLoc'] = df['WLoc'].apply(lambda x: loc_dict[x])

    swap_cols = ['Season', 'DayNum', 'LTeamID', 'LScore', 'WTeamID', 'WScore', 'WLoc', 'NumOT', 'Lnewid', 'Wnewid']

    df_swap = df[swap_cols].copy()

    df_swap['WLoc'] = df_swap['WLoc']*-1

    df.columns = [x.replace('WLoc','T1_Court')
                   .replace('W','T1_')
                   .replace('L','T2_') for x in list(df.columns)]

    df_swap.columns = df.columns

    df = pd.concat([df,df_swap])

    df['Win'] = (df['T1_Score']>df['T2_Score']).astype(int)
    df['Close_Game']= abs(df['T1_Score']-df['T2_Score']) <3
    df['Score_diff'] = df['T1_Score'] - df['T2_Score']
    df['Score_diff'] = df['Score_diff'] - (df['Score_diff']/df['Score_diff'].abs())
    df['T2_Court'] = df['T1_Court']*-1
    df[['T1_Court','T2_Court']] = df[['T1_Court','T2_Court']] + 1

    cols = df.columns.to_list()

    df = df[cols].sort_index()
    df.reset_index(drop=True,inplace=True)


    return df

df_regSeason_full = swapConcat_data(df_regSeason_data.copy().sort_values(by='DayNum'))
df_otherTourney_full = swapConcat_data(df_otherTourney_data.copy())

# Convert to numpy arrays in correct format
def prep_inputs(df,id_to_oh, col_outputs):
    Xteams = np.stack([df['T1_newid'].values,df['T2_newid'].values]).T
    Xloc = np.stack([df['T1_Court'].values,df['T2_Court'].values]).T

    if len(col_outputs) <2:
        Y_outputs = df[col_outputs].values
        Y_outputs = Y_outputs.reshape(len(Y_outputs),1)
    else:
        Y_outputs = np.stack([df[x].values for x in col_outputs])

    return [Xteams, Xloc], Y_outputs

X_train, Y_train = prep_inputs(df_regSeason_full, id_to_oh, ['Win','Score_diff'])
X_test, Y_test = prep_inputs(df_otherTourney_full, id_to_oh, ['Win','Score_diff'])

# Normalize point outputs - Win/loss unchanged
def normalize_outputs(Y_outputs, stats_cache=None):
    if stats_cache == None:
        stats_cache = {}
        stats_cache['mean'] = np.mean(Y_outputs,axis=1)
        stats_cache['var'] = np.var(Y_outputs,axis=1)
    else: pass
    
    numOut = Y_outputs.shape[0]
    Y_normout = (Y_outputs-stats_cache['mean'].reshape((numOut,1)))/stats_cache['var'].reshape((numOut,1))

    return Y_normout, stats_cache

Y_norm_train, stats_cache_train = normalize_outputs(Y_train,None)
Y_norm_test, _ = normalize_outputs(Y_test,stats_cache_train)
Y_norm_train[0,:] = Y_train[0,:]
Y_norm_test[0,:] = Y_test[0,:]

Building the model

This model is built with two input types - home/away flags and team IDs. Each input is repeated for each team and is fed through a location embedding layer and a team embedding layer. A school's embeddings are separate season to season (e.g. Duke 2019 $\neq$ Duke 2020). It would nice to be able to cary some dependency from year to year, but that is completely disregarded here for simplicity. The location embedding is 1-dimensional and multiplied by each team's embedding vector element by element. The team embeddings are separately fed through the same two-layers before being subtracted. This subtracted layerinally connect to two output layers - one 'softmax' for win/loss prediction and one dense layer with no activation for point prediction.

#collapse_show
# build model

tf.keras.backend.clear_session()

def NCAA_Embeddings_Joint(nteams,teamEmb_size):
    team_input = Input(shape=[2,],dtype='int32', name='team_input')
    X_team = Embedding(input_dim=nteams, output_dim=teamEmb_size, input_length=2, embeddings_initializer=glorot_uniform(), name='team_encoding')(team_input)

    loc_input = Input(shape=[2,],dtype='int32', name='loc_input')
    X_loc = Embedding(input_dim=3, output_dim=1, input_length=2, embeddings_initializer=glorot_uniform(), name='loc_encoding')(loc_input)
    X_loc = Lambda(lambda z: k.backend.repeat_elements(z, rep=teamEmb_size, axis=-1))(X_loc)
    
    X = Multiply()([X_team,X_loc])
    X = Dropout(rate=.5)(X)
    T1 = Lambda(lambda z: z[:,0,:])(X)
    T2 = Lambda(lambda z: z[:,1,:])(X)

    D1 = Dense(units = 20, use_bias=True, activation='tanh')
    DO1 = Dropout(rate=.5)

    D2 = Dense(units = 10, use_bias=True, activation='tanh')
    DO2 = Dropout(rate=.5)

    X1 = D1(T1)
    X1 = DO1(X1)

    X1 = D2(X1)
    X1 = DO2(X1)

    X2 = D1(T2)
    X2 = DO1(X2)

    X2 = D2(X2)
    X2 = DO2(X2)

    X_sub = Subtract()([X1,X2])

    output_p= Dense(units = 1, use_bias=False, activation=None, name='point_output')(X_sub)
    output_w= Dense(units = 1, use_bias=False, activation='sigmoid', name='win_output')(X_sub)


    model = Model(inputs=[team_input, loc_input],outputs=[output_w,output_p],name='ncaa_embeddings_joint')

    return model

mymodel = NCAA_Embeddings_Joint(len(id_to_oh),8)
mymodel.summary()
WARNING:tensorflow:From /Users/ryanarmstrong/opt/miniconda3/envs/ds37/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Model: "ncaa_embeddings_joint"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
loc_input (InputLayer)          (None, 2)            0                                            
__________________________________________________________________________________________________
team_input (InputLayer)         (None, 2)            0                                            
__________________________________________________________________________________________________
loc_encoding (Embedding)        (None, 2, 1)         3           loc_input[0][0]                  
__________________________________________________________________________________________________
team_encoding (Embedding)       (None, 2, 8)         92752       team_input[0][0]                 
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 2, 8)         0           loc_encoding[0][0]               
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 2, 8)         0           team_encoding[0][0]              
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 2, 8)         0           multiply_1[0][0]                 
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 8)            0           dropout_1[0][0]                  
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 8)            0           dropout_1[0][0]                  
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 20)           180         lambda_2[0][0]                   
                                                                 lambda_3[0][0]                   
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 20)           0           dense_1[0][0]                    
                                                                 dense_1[1][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 10)           210         dropout_2[0][0]                  
                                                                 dropout_2[1][0]                  
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 10)           0           dense_2[0][0]                    
                                                                 dense_2[1][0]                    
__________________________________________________________________________________________________
subtract_1 (Subtract)           (None, 10)           0           dropout_3[0][0]                  
                                                                 dropout_3[1][0]                  
__________________________________________________________________________________________________
win_output (Dense)              (None, 1)            10          subtract_1[0][0]                 
__________________________________________________________________________________________________
point_output (Dense)            (None, 1)            10          subtract_1[0][0]                 
==================================================================================================
Total params: 93,165
Trainable params: 93,165
Non-trainable params: 0
__________________________________________________________________________________________________

Training the model

The model is trained using regular season data and validated using secondary tournament data (not the 'Big Dance'). The weights of the two losses are adjusted so that they propogate a similar amount of error backward. Because the point differential data has been normalized, the losses are multiple orders of magnitude less than the log loss metric for wins/losses.

#collapse_show
# Joint model
optimizer = Adam(learning_rate=.01, beta_1=0.9, beta_2=0.999, amsgrad=False)
mymodel.compile(loss=['binary_crossentropy','logcosh'],
                loss_weights=[0.5,400],
                optimizer=optimizer,
                metrics = ['accuracy'])
numBatch = round(X_train[0].shape[0]/50)
results = mymodel.fit(X_train, [*Y_norm_train], validation_data=(X_test, [*Y_norm_test]), epochs = 30, batch_size = numBatch,shuffle=True, verbose=False)
WARNING:tensorflow:From /Users/ryanarmstrong/opt/miniconda3/envs/ds37/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.

#collapse_hide
accuracy = results.history['win_output_accuracy']
val_accuracy = results.history['val_win_output_accuracy']
loss = results.history['win_output_loss']
val_loss = results.history['val_win_output_loss']
# summarize history for accuracy
plt.plot(accuracy)
plt.plot(val_accuracy)
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(loss)
plt.plot(val_loss)
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

Results

The crossplots for point differential and win/loss are generally well behaved. The loss and accuracy of the model are not great in comparison to results from the Kaggle competition. For reference, anything below a loss of 0.5 would be considered fantastic and flipping a coin would give you a loss of about 0.69. We see a similar effect in the point spread prediction with a rather loose correlation of 0.47.

If the goal of this project was to have the best model for predicting the winner of an NCAA tournament game we would be failing (especially considering only the best play in the tournament - making predictions even harder). However, the goal here was to train embeddings not to get accurate predictions. Instead, we are using regular season data to train an embedding set that is representative of each team. We have only trained on wins/losses and points in this case, which might limit the utility of the features. Converserly, we will see in the next section that we have achieved a richer representation of the raw win/loss data than simply aggregating by teams.

#collapse_hide
def transform_y(preds,stats_cache):
    preds = stats_cache['var'][1] * preds + stats_cache['mean'][1]
    return preds

preds = mymodel.predict(X_test)

x = transform_y(preds[1],stats_cache_train).reshape(-1)
y = transform_y(Y_norm_test[1],stats_cache_train).reshape(-1)


print('Pearson coefficient: ', round(stats.pearsonr(x, y)[0]*100)/100)
plt.scatter(x, y, alpha=0.08)
plt.xlabel('Predicted point difference')
plt.ylabel('Actual point difference')
plt.show()
Pearson coefficient:  0.46
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

#collapse_hide
x = transform_y(preds[1],stats_cache_train).reshape(-1)
y = preds[0].reshape(-1)


print('Pearson coefficient: ', round(stats.pearsonr(x, y)[0]*100)/100)
plt.scatter(x, y, alpha=0.08)
plt.xlabel('Predicted point difference')
plt.ylabel('Predicted Win Probability')
plt.show()
Pearson coefficient:  1.0
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

#collapse_hide
plt.hist(y,bins=100)
plt.xlabel('Predicted Win Probability')
plt.ylabel('Count')
plt.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

One notable aspect of the point prediction result is that the predictions are perfectly symmetrical. The network is able to give consistent predictions for "Team A vs. Team B" and "Team B vs. Team A" because the neural network is set up to treat the input features consistently for each team. Other ML models, such as XGBoost, treat the feature inputs of Team 1 and Team 2 differently, which can result in varying predictions when the teams are swapped. This can be an issue even when training sets contain matchups and swapped matchups as documented in this discussion.

Exploratory Analysis

Let's take a look at some comparisons between our embeddings (mapped non-linearly into 2D space by T-SNE) vs. the aggregated point differential and win percentage of each team. All of these plots (excluding the final plot for 2020) will only include teams that were included in the NCAA tournament that year.

We'll color the scatter plots by a few different factors:

  • Highlighting tournament winners
  • Tournament seed number
  • Number of tournament games won
  • Some of the biggest upsets from these two articles -> (1, 2).

Here is a preview of the data that will be fed into the visualizations:

#collapse_hide
embeddings = mymodel.layers[3].get_weights()[0]

t = TSNE(n_components=2)
embed_tsne = t.fit_transform(embeddings)

df_regSeason_full['T1_TeamName'] = df_regSeason_full['T1_TeamID'].apply(lambda x: teams_dict[x]) + '-' + df_regSeason_full['Season'].astype(str)
df_agg=df_regSeason_full.groupby('T1_TeamName').mean()
df_agg.reset_index(inplace=True,drop=False)

df_agg[['T1_TeamName','Win','Score_diff']]
df_agg.drop(columns='Season',inplace=True)

df_tourney_data = pd.read_csv(dataLoc/'MNCAATourneyCompactResults.csv')
df_tourney_data['WTeamName'] = df_tourney_data['WTeamID'].apply(lambda x: teams_dict[x]) + '-' + df_tourney_data['Season'].astype(str)
df_tourney_data['Wins'] = 0
df_wins = df_tourney_data[['WTeamName','Wins']].groupby('WTeamName').count()
tourneyWinners = [df_tourney_data.loc[df_tourney_data['Season']==s,'WTeamName'].values[-1] for s in df_tourney_data['Season'].unique()]

df_seeds = pd.read_csv(dataLoc/'MNCAATourneySeeds.csv')
df_seeds['TeamName'] = df_seeds['TeamID'].apply(lambda x: teams_dict[x]) + '-' + df_seeds['Season'].astype(str)
df_seeds['Seed'] = df_seeds['Seed'].str.extract(r'(\d+)')
df_seeds['WonTourney'] = df_seeds['TeamName'].apply(lambda x: True if x in tourneyWinners else False)
df_seeds = df_seeds[['TeamName','Seed','WonTourney']]

df_upsets = pd.read_csv('./data/2020-05-04-NCAA-Embeddings/Upsets.csv')
df_upsets['David']=df_upsets['David']+'-'+df_upsets['Season'].astype(str)
df_upsets['Goliath']=df_upsets['Goliath']+'-'+df_upsets['Season'].astype(str)
upsets = {}
for ii in df_upsets['David'].unique():
    upsets[ii] = 'Surprise'
for ii in df_upsets['Goliath'].unique():
    upsets[ii] = 'Bust'
df_seeds = pd.merge(left=df_seeds, right=df_wins, how='left', left_on='TeamName',right_index=True)
df_seeds['Wins'].fillna(0,inplace=True)

def upset(x):
    try:
        y = upsets[x]
    except:
        y = None
    return y
df_seeds['Upset'] = df_seeds['TeamName'].apply(lambda x: upset(x))

df = pd.DataFrame(embed_tsne,columns=['factor1','factor2'])
df['TeamName'] = [str(teams_dict[int(oh_to_id[x][-4:])]) + '-' + oh_to_id[x][:4] for x in df.index]
df['Season'] = [int(oh_to_id[x][:4])for x in df.index]

df = pd.merge(left=df, right=df_seeds, how='left', on='TeamName')
df = pd.merge(left=df, right=df_agg, how='left', left_on='TeamName',right_on='T1_TeamName')

df = df[['TeamName','Season','factor1','factor2','Win','Score_diff','Seed','Wins','Upset','WonTourney']]
df.columns = ['TeamName','Season','factor1','factor2','RegWins','RegPoint_diff','Seed','TourneyWins','Upset','WonTourney']

df2020 = df[df['Season']==2020].copy()

df.dropna(inplace=True,subset=['Seed'])

df['TourneyWinsScaled'] = df['TourneyWins']/df['TourneyWins'].max()
df['SeedScaled'] = df['Seed'].astype(int)/df['Seed'].astype(int).max()

df.head()
TeamName Season factor1 factor2 RegWins RegPoint_diff Seed TourneyWins Upset WonTourney TourneyWinsScaled SeedScaled
2 Alabama-1985 1985 -67.399902 -18.627394 0.700000 7.400000 07 2.0 None False 0.333333 0.4375
8 Arizona-1985 1985 -54.920925 26.391088 0.666667 6.851852 10 0.0 None False 0.000000 0.6250
11 Arkansas-1985 1985 -61.437645 8.893435 0.636364 3.363636 09 1.0 None False 0.166667 0.5625
14 Auburn-1985 1985 -54.917221 24.073803 0.620690 3.448276 11 2.0 None False 0.333333 0.6875
21 Boston College-1985 1985 -61.193359 18.378933 0.615385 5.038462 11 2.0 None False 0.333333 0.6875

Important: For the following plots T-SNE representations of trained embeddings will be on the left and mean regular season statistics will be on the right.

#collapse_hide

axis_ranges = [[-80,75],
               [-80,75],
               [-10,30],
               [.2,1.2]]

def plot_comparison(df, colorBy, orderBy, axis_ranges):
    xrange_tsne = [-80,75]
    yrange_tsne = [-80,75]
    xrange_raw = [-10,30]
    yrange_raw = [.2,1.2]

    selector = alt.selection_single(empty='all', fields=['TeamName'])

    base = alt.Chart(df).mark_point(filled=True,size=50).encode(
        color=alt.condition(selector,
                            colorBy,
                            alt.value('lightgray') ),
        order=orderBy,
        tooltip=['TeamName','Seed']
    ).properties(
        width=250,
        height=250
    ).add_selection(selector).interactive()

    chart1 = [alt.X('factor1:Q',
                    scale=alt.Scale(domain=axis_ranges[0]),
                    axis=alt.Axis(title='T-SNE factor 1')),
            alt.Y('factor2:Q',
                    scale=alt.Scale(domain=axis_ranges[1]),
                    axis=alt.Axis(title='T-SNE factor 2'))]

    chart2 = [alt.X('RegPoint_diff:Q',
                    scale=alt.Scale(domain=axis_ranges[2]),
                    axis=alt.Axis(title='Average Regular Season Point Difference')),
            alt.Y('RegWins:Q',
                    scale=alt.Scale(domain=axis_ranges[3]),
                    axis=alt.Axis(format='%', title='Regular Season Win Percentage'))]

    return base, chart1, chart2

colorBy = alt.Color('Seed:Q', scale=alt.Scale(scheme='viridis',reverse=True))
orderBy = alt.Order('Seed:Q', sort='descending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Colored by seed: We see a high correlation between the assigned seed and our embeddings. Our embeddings appear to be a better representation of the seeding than the aggregated statistics, which makes sense since our method uses pair-wise comparisons and effectively accounts for team strength while aggregated statistics do not.

#collapse_hide
colorBy = alt.Color('TourneyWins:Q', scale=alt.Scale(scheme='viridis',reverse=False))
orderBy = alt.Order('TourneyWins:Q', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Colored by number of NCAA tournament games won that year: The embeddings appear to be far less correlated to the number of games won by tournament. This is logical since, unlike the seeds, this statistic is not at all represented in the training set.

#collapse_hide
colorBy = alt.Color('WonTourney:N', scale=alt.Scale(scheme='tableau10'))
orderBy = alt.Order('WonTourney:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Tournament winners compared to the field (above): One interesting insight below is how significantly different the 1985 Villanova team is from the other tournament winners. Multiple websites (like this one) list the 1985 Villanova team winning the championship as one of the greatest underdog stories ever. This is far more evident in the T-SNE representation of the embeddings than the plots of win percentage vs. points.

#collapse_hide
colorBy = alt.Color('Upset:N', scale=alt.Scale(scheme='tableau10'))
orderBy = alt.Order('Upset:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

Biggest upsets compiled from here and here - underdogs in red: Generally, the model agrees with the experts. These were upsets and wouldn't have been predicted by this method. If anything this method would likely have predicted no upset with even greater conviction tahn a model trained on just aggregated points and wins. The only exception to this is the 1986 "upset" of Cleveland State over the Indiana Hoosiers. Both the embeddings model and the aggregated statistics indicate that Cleveland State may have been the better team. Perhaps it was an issue of name recognition that lead this to be called an upset?

#collapse_hide
colorBy = alt.Color('TourneyWins:Q', scale=alt.Scale(scheme='viridis',reverse=False))
orderBy = alt.Order('TourneyWins:Q', sort='ascending')
base, chart1, chart2 = plot_comparison( df,colorBy, orderBy, axis_ranges)
base = base.add_selection(select_year).transform_filter(select_year)

base.encode(*chart1)  | base.encode(*chart2)

Number of games won split out by season - yellow dot is tournament winner: The spread of teams is quite variable year to year. Notably, the tournament that the 1985 Villanova team won as a heavy underdog has less spread in the competition than other years.

#collapse_hide
colorBy = alt.Color('TeamName:N')
axis_ranges = [[-80,80],[-100,70],[-30,30],[0,1]]
orderBy = alt.Order('Upset:N', sort='ascending')
base, chart1, chart2 = plot_comparison( df2020,colorBy, orderBy, axis_ranges)

base.encode(*chart1)  | base.encode(*chart2)

The 2020 field: Just for fun here is a taste of what we missed in 2020!

Conclusions

The embeddings apppear to have learned which teams are better and which are worse. It seems that they are a better representation of true team skill than simple aggregating the statistics used in model training (wins and point differentials). When the time comes to build a model for the 2021 March Madness Kaggle competition, I will likely return to embeddings as an advanced input feature for my final model, which will be trained on real tournament games! Then will be the time to experiment with training the team embeddings on advanced statistics included in the detailed Kaggle data set in place of or in addition to the target features used here.